Prefilter

2021-04-19

Introduction

Large and heterogeneous datasets may contain thousands of records missing spatial or taxonomic information (partially or entirely) as well as several records outside a region of interest or from doubtful sources. Such lower quality data are not fit for use in many research applications without prior amendments. The ‘Pre-filter’ step contains a series of of tests to detect, remove, and, whenever, possible, correct such erroneous or suspect records.


Important:

The results of VALIDATION test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.

Installation

You can install the released version of ‘BDC’ from github with:

if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")

Creating folders to save the results

bdc::bdc_create_dir()

Read the database

Read the merged database created in the step Standardization and integration of different datasets of the BDC workflow. It is also possible to read any datasets containing the required fields to run the workflow.

database <-
  qs::qread("Output/Intermediate/00_merged_database.qs")

Standardization of character encoding

for (i in 1:ncol(database)){
  if(is.character(database[,i])){
    Encoding(database[,i]) <- "UTF-8"
  }
}



1 - Records missing species names

VALIDATION. This test flags records missing species names

check_pf <- bdc_scientificName_empty(
  data = database,
  sci_name = "scientificName")
#> 
#> bdc_scientificName_empty:
#> Flagged 324 records.
#> One column was added to the database.

2 - Records lacking information on geographic coordinates

VALIDATION. This test flags records missing partial or complete information on geographic coordinates.

check_pf <- bdc_coordinates_empty(
  data = check_pf,
  lat = "decimalLatitude",
  lon = "decimalLongitude")
#> 
#> bdc_coordinates_empty:
#> Flagged 1921 records.
#> One column was added to the database.

3 - Records with out-of-range coordinates

VALIDATION. This test flags records with out-of-range coordinates, that is latitude > 90 or -90; longitude >180 or -180.

check_pf <- bdc_coordinates_outOfRange(
  data = check_pf,
  lat = "decimalLatitude",
  lon = "decimalLongitude")
#> 
#> bdc_coordinates_outOfRange:
#> Flagged 23 records.
#> One column was added to the database.

4 - Records from distrustful sources

VALIDATION. This test flags records from doubtful source. For example, records from drawings, photographs, or multimedia objects, fossil records, among others.

check_pf <- bdc_basisOfRrecords_notStandard(
  data = check_pf,
  basisOfRecord = "basisOfRecord",
  names_to_keep = "all")
#> 
#> bdc_basisOfRrecords_notStandard:
#> Flagged 5 of the following specific nature:
#>  c("FOSSIL_SPECIMEN", "Extra", "Liqui") 
#> One column was added to the database.

5 - Getting country names from valid coordinates

ENRICHMENT. Deriving country names for records missing country names.

check_pf <- bdc_country_from_coordinates(
  data = check_pf,
  lat = "decimalLatitude",
  lon = "decimalLongitude",
  country = "country")
#> 
#> bdc_country_from_coordinates:
#> Country names were added to 1123 records.

6 - Standardizing country names and getting country code information

ENRICHMENT. Country names are standardized against a list of country names in several languages retrieved from Wikipedia.

check_pf <- bdc_country_standardized(
  data = check_pf,
  country = "country"
)
#> Loading auxiliary data: country names from wikipedia
#> Loading auxiliary data: world map and country iso
#> Standardizing country names
#> country found: Argentina
#> country found: Belize
#> country found: Bolivia
#> country found: Brazil
#> country found: Colombia
#> country found: Ecuador
#> country found: France
#> country found: French Guiana
#> country found: Guyana
#> country found: Honduras
#> country found: Japan
#> country found: Mexico
#> country found: Nicaragua
#> country found: Paraguay
#> country found: Suriname
#> country found: Uruguay
#> country found: Venezuela
#> 
#> bdc_country_standardized:
#> The country names of 8540 records were standardized.
#> Two columns were added to the database.

7 - Correcting latitude and longitude transposed

AMENDMENT. The mismatch between informed country and coordinates can be the result of negative or transposed coordinates. Once detected a mismatch, different coordinate transformations are made to correct the country and coordinates mismatch. Verbatim coordinates are then replaced by the rectified ones in the returned database (a database containing verbatim and corrected coordinates is also created in the “Output” folder).

check_pf <-
  bdc_coordinates_transposed(
    data = check_pf,
    id = "database_id",
    sci_names = "scientificName",
    lat = "decimalLatitude",
    lon = "decimalLongitude",
    country = "country",
    countryCode = "countryCode", 
    border_buffer = 0.2 # in decimal degrees (~22 km at the equator)
  )
#> Correcting latitude and longitude transposed
#> Testing coordinate validity
#> Removed 1522 records.
#> Testing coordinate validity
#> Flagged 0 records.
#> Testing sea coordinates
#> Flagged 704 records.
#> Testing country identity
#> Flagged 716 records.
#> Flagged 716 of 7018 records, EQ = 0.1.
#> 716 ocurrences will be tested
#> Processing occurrences from: BR (713)
#> Processing occurrences from: CO (1)
#> Processing occurrences from: MX (1)
#> Processing occurrences from: VE (1)
#> 
#> bdc_coordinates_transposed:
#> Corrected 19 records.
#> One columns were added to the database.
#> Check database containing coordinates corrected in:
#> Output/Check/01_coordinates_transposed.csv

8 - Records outside a region of interest

VALIDATION. Records outside one or multiple reference countries; i.e., records in other countries or at an informed distance from the coast (e.g., in the ocean). This last step avoids flagging as invalid records close to country limits (e.g., records of coast or marshland species).

check_pf <-
  bdc_coordinates_country_inconsistent(
    data = check_pf,
    country_name = "Brazil",
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    dist = 0.1 # in decimal degrees (~11 km at the equator)
  )
#> dist is assumed to be in decimal degrees (arc_degrees).
#> although coordinates are longitude/latitude, st_intersection assumes that they are planar
#> 
#> bdc_coordinates_country_inconsistent:
#> Flagged 658 records.
#> One column was added to the database.

9 - Save records not geo-referenced but with locality information

ENRICHMENT. Coordinates can be derived from a detailed description of the locality associated with records in a process called retrospective geo-referencing.

xyFromLocality <- bdc_coordinates_from_locality(
  data = check_pf,
  locality = "locality",
  lon = "decimalLongitude",
  lat = "decimalLatitude"
)
#> 
#> bdc_coordinates_from_locality 
#> Found 1944 records missing or with invalid coordinates but with potentially useful information on locality.
#>  
#> Check database in: C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Check/01_coordinates_from_locality.csv

Report

Creating a column named “.summary” summarizing the results of all VALIDATION tests. This column is “FALSE” if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).

check_pf <- bdc_summary_col(data = check_pf)
#> 
#> bdc_summary_col:
#> Flagged 2888 records.
#> One column was added to the database.



Creating a report summarizing the results of all tests.

report <-
  bdc_create_report(data = check_pf,
                    database_id = "database_id",
                    workflow_step = "prefilter")
#> 
#> bdc_create_report:
#> Check the report summarizing the results of the prefilter in:
#> Output/Report

report
Description Test_name Records_flagged perc_number_records(*)
Records with empty scientific name .scientificName_empty 324 3.6
Records with empty coordinates .coordinates_empty 1921 21.34
Records coordiantes out-of-range .coordinates_outOfRange 23 0.26
Records from doubtful source .basisOfRrecords_notStandard 5 0.06
Records outside one or multiple reference countries .coordinates_country_inconsistent 658 7.31
Summary of all tests .summary 2888 32.09
(*) calculated in relation to total number of records, i.e. 9000 records


Figures

Creating figures (bar plots and maps) to facilitate the interpretation of the results of data quality tests.

bdc_create_figures(data = check_pf,
                   database_id = "database_id",
                   workflow_step = "prefilter")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Figures
Transposed coordinates

Transposed coordinates

Coordinates and contry inconsistent

Coordinates and contry inconsistent

Summary of all tests

Summary of all tests


Filter the database

It is possible to removed flagged records (potentially problematic ones) to get a ‘clean’ database (i.e., without test columns starting with “.”). However, to ensure that all records be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal steps of the workflow), potentially erroneous or suspect records will be removed in the final step of the workflow.

# output <-
#   check_pf %>%
#   dplyr::filter(.summary == TRUE) %>%
#   bdc_filter_out_flags(data = ., col_to_remove = "all")

Save the database

check_pf %>%
  qs::qsave(.,
            here::here("Output", "Intermediate", "01_prefilter_database.qs"))